In-silico folding of a three helix protein and characterization of its free-energy 

landscape in an all-atom forcefield 
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We report the reproducible first-principles folding of the 40 amino acid, three-helix headpiece of 
the HIV accessory protein in a recently developed all-atom free-energy forcefield. Six of twenty 
simulations using an adapted basin-hopping method converged to better than 3 A backbone RMS 
deviation to the experimental structure. Using over 60,000 low-energy conformations of this protein, 
we constructed a decoy tree that completely characterizes its folding funnel. 

PACS numbers: 87.15.Cc,02.70.Ns,02.60.Pn 



Available genomic and sequence information for pro- 
teins contains a wealth of biomedical information that be- 
comes accessible when translated into three-dimensional 
structure . While theoretical models for protein struc- 
ture prediction 0, Q that partially rely on experimen- 
tal information have shown consistent progress the 
assessment of de-novo strategies that rely on sequence 
information alone has been much less favorable [5j. The 
development of such techniques, in particular of transfer- 
able first-principle all- atom folding methods, would sig- 
nificantly benefit the understanding of protein families 
where little experimental information is available, the 
prediction of novel folds as well as the investigation of 
protein association and dynamics which are presently dif- 
ficult to probe experimentally. Recent progress for small 
peptides |2l ul |2L ul documents both the feasibility of this 
approach as well as its limitations 0, 0] , in particualr 
those associated with the direct simulation of the folding 
process through molecular dynamics 1 1] . 

Overwhelming experimental evidence supports the 
thermodynamic hypothesis 12] that many proteins are 
in thermodynamic equilibrium with their environment: 
their native state thus corresponds to the global mini- 
mum of their free energy landscape |13l | . The free energy 
of the system is accessible either indirectly by explicit 
ensemble averaging of the combined internal energy of 
protein and solvent, or directly in a free-energy force- 
field where an implicit solvation model approximates di- 
rect interactions with the solvent as well as most of the 
entropic contributions. We de velop ed an all-atom pro- 
tein forcefield (PFF01) H [H El] with an area-based 
implicit solvent model that approximates the free energy 
of peptide conformations in the natural solvent. Using 
a rational decoy approach this forcefield was optimized 
to correctly predict the native structure of the 36-amino 
acid headgroup of villin 0, 0, 0] . Without further pa- 
rameter adjustment we then simulated the structurally 
conserved 40 amino-acid headpiece of the autonomously 
folding HIV accessory protein (1F4I-40) [l£| with a mod- 
ified basin hopping technique 0, ^g]- Out of twenty 
simulations the five energetically lowest correctly repro- 



duced the NMR structure of this three-helix protein with 
a backbone RMS deviation of less than 3 A. The combina- 
tion of decoy based model development for the free energy 
with efficient stochastic optimization methods suggests a 
viable route for protein structure prediction at the all 
atom level with present day computational resources. 

Model: We have recently developed an all-atom pro- 
tein forcefield (PFF01), which was used to reproducibly 
fold the 20 amino acid trp-cage protein ||. PFF01 
comprises an atomically resolved electrostatic model with 
group specific dielectric constants and a Lennard Jones 
parameterization that was adapted to the experimental 
distance distributions from crystal structures of 138 pro- 
teins |lfll | . Interaction with the fictitious solvent are mod- 
eled in a simple solvent accessible surface approach [20j, 
where the solvation free-energies per unit surface were fit- 
ted to the enthalpies of solvation of the Gly-X-Gly series 
of peptides |2J| . The only low-energy degrees of freedom 
available to the peptide during the folding process are 
rotations of the dihedral angles of the backbone and the 
sidechains, these are the only moves considered during 
the simulation. There are two move-classes, small ran- 
dom rotations about a single angle and library moves, 
which set a particular backbone dihedral to a permitted 
value in the Ramachandran plot. 

If an accurate model for the free energy of the pro- 
tein in its environment is available, stochastic optimiza- 
tion methods can be used to locate the global optimum 
of the free-energy landscape orders of magnitude faster 
than traditional simulation techniques. We adapted the 
basin hopping technique E3 (BHT) for protein simula- 
tions by replacing a single minimization step with a simu- 
lated annealing run [22j with self-adapting cooling cycle 
and length. At the end of one annealing step the new 
conformation was accepted if its energy difference to the 
current configuration was no higher than a given thresh- 
old energy et, an approach recently proven optimal for 
certain optimization problems |23l | . Each simulation was 
performed in three separate steps: First we used a high 
temperature bracket of 800/300 K (et =15 K) for the an- 
nealing window and reduced the strength of the solvent 
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Name 


RMSB 


Energy 


Secondary Structure Content 


N 


0.00 




c cHHHHHHHHH c 1 cbHHHHHHHHHH c 1 c c cHHHHHHHHH c 


D01 


2.34 


-119.54 


cHHHHHHHHHHHlcbcHHHHHHHHHHHHbHHHHHHHHHHc 


D02 


2.41 


-117.52 


cHHHHHHHHHHHlcbHHHHHHHHHHHHHbHHHHHHHHHHc 


D03 


2.76 


-116.25 


cHHHHHHHHHHHlcbHHHHHHHHHHHHHbHHHHHHHHHHc 


D04 


2.40 


-115.85 


cHHHHHHHHHHHlbbHHHHHHHHHHHHHbHHHHHHHHHHc 


D05 


2.43 


-114.67 


cHHHHHHHHHHHlcbHHHHHHHHHHHcbHHHHHHHHHHHc 


D06 


6.48 


-114.06 


cHHHHHHHHHHHcccbHHHHHHHHHHHHbHHHHHHHHHHc 


D07 


2.57 


-113.65 


cHHHHHHHHHHHlbbcHHHHHHHHHHHHbHHHHHHHHHHc 


D08 


4.61 


-107.72 


cHHHHHHHHHc cl c cHHHHHHHHHHHHHl c lHHHHHHHHc 


D09 


4.14 


-106.29 


cHHHHHHHHHHHcbcbHHHHHHHHHbblcHHHHHHHHHHc 


D10 


5.92 


-103.88 


cHHHHHHHHHHHl cHHHHHHHHHb cb c c lbHHHHHHHHH c 



TABLE I: Table of the 10 lowest energy decoys (of 20, remainder available by request from the authors) with backbone RMS 
deviation to the NMR structure and secondary structure content. The first row designates the secondary structure content of 
the NMR structure. 



terms in the forcefield by 20%. The second step started 
from the final configurations of the first run, used the 
same annealing window but the full solvent interactions. 
In the third step the resulting structures were further an- 
nealed in a low temperature bracket of 600/3 K (£t=1K). 
Within each annealing run the temperature was geomet- 
rically decreased, also the number of steps per annealing 
run was gradually increased to ensure better convergence. 
In total each simulation comprised 10 7 energy evaluations 
with a computational effort roughly corresponding to a 
10 ns MD simulation (in vacuum with 1 fs timestep) . Af- 
ter this time no significant energy fluctuations occurred 
in the simulations, indicating that each had settled into 
a metastable configuration. 

Results: Using PFF01 we performed 20 independent 
modified basin hopping simulations of the structurally 
conserved 40 amino-acid headpiece of the HIV accessory 
protein (pdb-code 1F4I, sequence: QEKEAIERLK AL- 
GFEESLVI QAYFACEKNE NLAANFLLSQ). The best 
structures found in each run were ranked according to 
their energy and the backbone RMS deviation (RMSB) 
to the NMR structure was computed. Table JIJ demon- 
strates that the five lowest structures had to good accu- 
racy converged to the NMR structure of the protein. The 
first non-native decoy appears in position six, with an 
energy deviation of 5 kcal/mol (in our model) and a sig- 
nificant RMSB deviation. The table demonstrates that 
all low-energy structures have essentially the same sec- 
ondary structure, i.e. position and length of the helices 
are always correctly predicted, even if the protein did not 
fold correctly. The degree of secondary structure content 
and similarity decreases for the decoys with higher energy 
(data not shown) in good correlation with their energy. 
From the standpoint of the optimization approach (but 
not necessarily for the physical folding scenario) this sug- 
gests that successful folding requires the formation of the 



correct secondary structure content as a prerequisite. 

The good agreement between the folded and the ex- 
perimental structure is also evident from Figure 
which shows the secondary structure alignment of the 
native and the folded conformations. The good physi- 
cal alignment of the helices illustrates the importance of 
hydrophobic contacts to correctly fold this protein. An 
independent measure to assess the quality of these con- 
tacts is to compare the Cp-Cp distances (which corre- 
spond to the NOE constraints of the NMR experiments 
that determine tertiary structure) in the folded structure 




FIG. 1: Overlay of the secondary structure elements of the 
native configuration and the folded structure of 1F4I-40 
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FIG. 2: Color coded Cp-Cp distance error map for the folded 
structure 1F4I in comparison to the NMR structure. Each 
square encodes the deviation between the Cp-Cp distance of 
two amino acids in the NMR to the C/3-C/3 distance of the 
same amino acids in folded structure. Black (grey) squares in- 
dicate a deviation of less than 1.50 A (2.25 A). White squares 
indicate large deviations. 



to those of the native structure. The color coded C/3-C/3 
distance in Figure J2J) demonstrates a 66 % (80 %) co- 
incidence of the Cp-Cp distance distances to within one 
(1.5) standard deviations of the experimental resolution. 
The dark diagonal block indicate intra-helical contacts, 
which are, perhaps not too surprisingly, resolved to very 
good accuracy. The off-diagonal dark blocks, however, 
indicate that also a large fraction of long range native 
contacts is reproduced correctly. 

Starting from intermediate structures of the folding 
simulations we generated over 60,000 low-energy confor- 
mations (decoys). Decoys with a root mean square de- 
viation of the backbone (RMSB) deviation of less than 
3 A were grouped into families with free energy brack- 
ets of 2 kcal/mol. We then resolved the topological 
hierarchy^^ |3| of the associated potential energy sur- 
face through the construction of a decoy tree (Figure ®) 
that illustrates the low-energy structure of the free en- 
ergy surface. Beginning from the best conformation, we 
draw a vertical line for each decoy family in this window. 
Moving upward in energy the number of decoys in each 
family grows almost exponentially in the low energy re- 
gion which we can resolve well. As a result the diversity 
of each family grows until different families unite. Family 
membership is associative, i.e. as soon as two decoys in 
different branches have an RMSB deviation of less than 3 
A, all members of both families belong to one superfam- 
ily. Pictorially this representation results in an inverted 
tree-like structure that characterizes the topology of the 
metastable states of the free energy surface. 

Trees with very short stems and many low-energy 
branches are characteristic of glassy potential energy 



surfaces, which are associated with Levinthals para- 
dox [23, |26( in the context of protein folding. Well struc- 
tured trees with few terminal branches suggest the exis- 
tence of a folding funnel 0], consistent with the "new" 
paradigm for protein folding [27|]. From this perspec- 
tive the structure in Figure 10 this tree is indicative of 
the existence of a very broad folding funnel with pro- 
nounced competing secondary metastable conformations, 
which are depicted as the non-native terminal minima of 
the tree. The discretization of the energy axis in inter- 
vals of 2 kcal/mol starting from the native conformation 
results in a smoothing of the free-energy surface. Each 
line in the figure corresponds to a family containing hun- 
dreds tothousands of structures, which are all associated 
with the same low-lying metastable conformation (the 
terminal point of the branch). Simulations trapped in 
such a metastable state must overcome a potential en- 
ergy barrier of the order of the energy difference to the 
next highest branching point of the tree to visit another 
structure. The branching points of the tree were con- 
structed only from structures of the decoy set and not 
through independent transition state search among the 
terminal structures. In addition, main-chain entropy is 
neglected in this analysis, which results in an overesti- 
mation of the barrier. Further investigations to more 
accurately determine the transition states are presently 
under way. 

The lowest competing terminal branch (branch C), 
associated with decoy D06 in Tabic JIJ, is less than 5 
kcal/mol above the best native decoy. Decoy D06 has 
comparable energy to competing decoys but much larger 
RMSB deviation and has few long-range native contacts. 
This structure (see Figure (@J) has also three helices 
of comparable length and sequence location, but differs 
from the native structure in the relative alignment of the 
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FIG. 3: Decoy tree of the low energy configurations of the 40 
amino acid headpiece of th HIV accessory protein. The hori- 
zontal axis depicts the total energy the chart on the right the 
total number of decoys in the primary and secondary funnels. 
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FIG. 4: Overlays of the secondary structure elements of the 
native (green) configuration and the lowest misfolded decoy 
(red) of 1F4I. 

helices with respect to each. The RMSB deviation of the 
low-lying terminal structures to the NMR structures is 
large (i.e. comparable with the RMSB deviation to un- 
folded structures), indicating that conserved secondary 
structure elements, rather than distance constraints char- 
acterize the folding funnel. The low lying local minima 
are thus characterized by varying spatial arrangements of 
similar secondary structure elements, a property that is 
ill represented by either RMSB deviation or the number 
of native contacts. 

Discussion: With this work we have demonstrated that 
accurate free-energy forcefields can predict the native 
structure of proteins with nontrivial tertiary structure 
with present day computational resources. This result 
represents an in-silico realization of the thirty year-old 
thermodynamic hypothesis that proteins are in thermo- 
dynamic equilibrium with the environment under physio- 
logical conditions 0]. Under this hypothesis, the native 
structure of the protein can be predicted using stochastic 
optimization methods orders of magnitude faster than by 
direct simulation. Our results demonstrate that the im- 
portant influence of the solvent can be modeled with a 
relatively simple solvent accessible surface approach. 

The analysis of the free energy surface supports the 
funnel paradigm of protein folding for a nontrivial pro- 
tein with a significantly larger hydrophobic core than was 
previously possible. The relatively small number of ter- 
minal branches of the decoy tree offers the first glimpse 
on the experimentally inaccesible structure of the folding 
funnel. It suggest the exsitence of a broad folding fun- 
nel with well defined secondary metastable states which 
may constitute important folding intermediates. The free 
energy optimization approach used here permitted the 
characterization of these low-lying states, which surpris- 



ingly share very similar secondary structure with the na- 
tive configurations. Investigations of other proteins must 
show, whether this pattern persist also for other proteins. 

It should be noted that the success of the optimiza- 
tion approach depends strongly on the ability of the opti- 
mization technique to differentiate between the low-lying 
minima of the FES in a realistic forcefield. The perfor- 
mance of optimization methods is often strongly problem 
dependent, but with 1F4I, 1VII and 1L2Y three nontriv- 
ial model systems exist on which different optimization 
methods can be evaluated. The decoy sets generated and 
insights regarding low-lying metastable states can also 
serve as a test-bed for the development of coarse-grained 
protein models. PFF01 is presently be validated for other 
peptides and proteins and rationally evolved to correctly 
predict the structure of larger fold families. 
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